Comparative Analysis of TMP Databases¶
Project: Digital Teotihuacan Mapping Project (TMP) - Phase 1
Objective: This notebook synthesizes the results from the entire profiling pipeline to conduct a comparative analysis of the four legacy databases and the two wide-format benchmark databases. Its primary goal is to use quantitative data to compare these database architectures on three key axes: Structural Complexity, Resource Usage, and Query Performance.
The findings from this notebook will directly inform the final recommendation for the Phase 2 unified database architecture.
1. Setup and Configuration¶
import pandas as pd
import numpy as np
from pathlib import Path
from IPython.display import display, Markdown
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
import os
# --- Path Definitions ---
# Use more robust path construction based on notebook file location
try:
# Try to get notebook's directory (works in most Jupyter environments)
NOTEBOOK_DIR = Path(os.getcwd())
# Find the project root by looking for specific markers
current_path = NOTEBOOK_DIR
while current_path != current_path.parent:
if (current_path / "TASKS.md").exists() or (current_path / "pyproject.toml").exists():
PROJECT_ROOT = current_path
break
current_path = current_path.parent
else:
# Fallback: assume standard structure
PROJECT_ROOT = NOTEBOOK_DIR.parent.parent.parent.parent
REPORTS_DIR = PROJECT_ROOT / "phases" / "01_LegacyDB" / "outputs" / "reports"
except Exception as e:
print(f"Warning: Path detection failed ({e}). Using fallback path construction.")
# Fallback to original approach
PROJECT_ROOT = Path.cwd().parent.parent
REPORTS_DIR = PROJECT_ROOT / "outputs" / "reports"
# --- Styling and Display Options ---
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
def display_header(title):
display(Markdown(f'### {title}'))
print("✅ Setup complete.")
print(f"Project Root: {PROJECT_ROOT}")
print(f"Reports Directory: {REPORTS_DIR}")
print(f"Reports Directory exists: {REPORTS_DIR.exists()}")
✅ Setup complete. Project Root: C:\Users\rcesa\ASU Dropbox\Rudolf Cesaretti\GitHubRepos\TeoMappingProject Reports Directory: C:\Users\rcesa\ASU Dropbox\Rudolf Cesaretti\GitHubRepos\TeoMappingProject\phases\01_LegacyDB\outputs\reports Reports Directory exists: True
2. Data Loading¶
We load two key outputs from the 04_run_comparison.py script:
comparison_matrix.csv: High-level summary metrics.report_performance_summary_detailed.csv: The enriched, long-form performance data with calculated comparative metrics.
matrix_path = REPORTS_DIR / 'comparison_matrix.csv'
perf_path = REPORTS_DIR / 'report_performance_summary_detailed.csv'
# Validate file existence with detailed error reporting
missing_files = []
if not matrix_path.exists():
missing_files.append(str(matrix_path))
if not perf_path.exists():
missing_files.append(str(perf_path))
if missing_files:
print(f"Directory contents: {list(REPORTS_DIR.glob('*')) if REPORTS_DIR.exists() else 'Directory does not exist'}")
raise FileNotFoundError(f"Critical Error: The following report files were not found:\n" +
"\n".join(f" - {f}" for f in missing_files) +
f"\n\nSearched in: {REPORTS_DIR}\n" +
"Please run the 04_run_comparison.py script first to generate these files.")
try:
# Load the matrix and transpose it so databases are rows
comparison_df = pd.read_csv(matrix_path, index_col=0).T.reset_index().rename(columns={'index': 'Database'})
print(f"✅ Successfully loaded comparison matrix: {comparison_df.shape[0]} databases, {comparison_df.shape[1]} metrics")
# Load the detailed performance data
perf_summary_df = pd.read_csv(perf_path)
print(f"✅ Successfully loaded performance data: {perf_summary_df.shape[0]} records")
except Exception as e:
print(f"Error loading data files: {e}")
raise
print("\nLoaded Comparison Matrix:")
display(comparison_df)
print("\nLoaded Detailed Performance Summary Data:")
display(perf_summary_df.head())
print(f"\nPerformance data columns: {list(perf_summary_df.columns)}")
print(f"Unique databases in performance data: {perf_summary_df['database'].unique() if 'database' in perf_summary_df.columns else 'No database column found'}")
✅ Successfully loaded comparison matrix: 6 databases, 9 metrics ✅ Successfully loaded performance data: 18 records Loaded Comparison Matrix:
| Database | Database Size (MB) | Table Count | View Count | Total Estimated Rows | Total Index Count | JDI (Join Dependency Index) | LIF (Logical Interop. Factor) | NF (Normalization Factor) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | TMP_DF10 | 64.00 | 9.00 | 0.00 | 485797.00 | 29.00 | 0.28 | 7.00 | 0.25 |
| 1 | TMP_DF8 | 20.00 | 27.00 | 0.00 | 136350.00 | 27.00 | 0.07 | 2.00 | 0.21 |
| 2 | TMP_DF9 | 28.00 | 62.00 | 0.00 | 106109.00 | 196.00 | 0.07 | 9.00 | 0.35 |
| 3 | TMP_REAN_DF2 | 14.00 | 13.00 | 0.00 | 65715.00 | 13.00 | 0.15 | 1.00 | 0.19 |
| 4 | tmp_benchmark_wide_numeric | 21.00 | 1.00 | 0.00 | 5050.00 | 0.00 | NaN | NaN | NaN |
| 5 | tmp_benchmark_wide_text_nulls | 62.00 | 1.00 | 0.00 | 5050.00 | 19.00 | NaN | NaN | NaN |
Loaded Detailed Performance Summary Data:
| query_name | status | latency_ms | records_returned | error_message | executed_sql | database | is_benchmark | category | query_id | baseline_latency_ms | schema_efficiency_factor | performance_improvement_factor | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline Performance - Query 1.1 | Success | 1.48 | 1 | NaN | SELECT COUNT(*) FROM public.wide_format_data; | tmp_benchmark_wide_numeric | True | Baseline Performance | Query 1.1 | 1.48 | 1.00 | 0.00 |
| 1 | Join_performance Performance - Query 2.1 | Success | 12.03 | 4595 | NaN | SELECT\r\n "site",\r\n "subsite",\r\n ... | tmp_benchmark_wide_numeric | True | Join_performance Performance | Query 2.1 | 12.03 | 1.00 | 0.00 |
| 2 | Complex_filtering Performance - Query 3.1 | Success | 2.50 | 1 | NaN | SELECT SUM("obsidianBlades") AS total_obsidian... | tmp_benchmark_wide_numeric | True | Complex_filtering Performance | Query 3.1 | 1.94 | 1.29 | 0.00 |
| 3 | Baseline Performance - Query 1.1 | Success | 2.03 | 1 | NaN | SELECT COUNT(*) FROM public.wide_format_data; | tmp_benchmark_wide_text_nulls | True | Baseline Performance | Query 1.1 | 1.48 | 1.37 | 0.00 |
| 4 | Join_performance Performance - Query 2.1 | Success | 16.12 | 4595 | NaN | SELECT\r\n "site",\r\n "subsite",\r\n ... | tmp_benchmark_wide_text_nulls | True | Join_performance Performance | Query 2.1 | 12.03 | 1.34 | 0.00 |
Performance data columns: ['query_name', 'status', 'latency_ms', 'records_returned', 'error_message', 'executed_sql', 'database', 'is_benchmark', 'category', 'query_id', 'baseline_latency_ms', 'schema_efficiency_factor', 'performance_improvement_factor'] Unique databases in performance data: ['tmp_benchmark_wide_numeric' 'tmp_benchmark_wide_text_nulls' 'TMP_DF10' 'TMP_DF8' 'TMP_DF9' 'TMP_REAN_DF2']
3. High-Level Comparison Matrix¶
A styled view of the main comparison matrix. Color gradients highlight high/low values for each metric, providing an at-a-glance summary.
- Purple/Dark: Higher values
- Yellow/Light: Lower values
display_header("Styled Comparison Matrix")
styled_df = comparison_df.style.background_gradient(cmap='viridis', axis=0)\
.set_caption("Comparative Database Metrics")\
.format('{:.2f}', subset=pd.IndexSlice[:, ['Database Size (MB)', 'JDI (Join Dependency Index)', 'NF (Normalization Factor)']])\
.format('{:,.0f}', subset=pd.IndexSlice[:, ['Table Count', 'View Count', 'Total Estimated Rows', 'Total Index Count', 'LIF (Logical Interop. Factor)']])
display(styled_df)
Styled Comparison Matrix¶
| Database | Database Size (MB) | Table Count | View Count | Total Estimated Rows | Total Index Count | JDI (Join Dependency Index) | LIF (Logical Interop. Factor) | NF (Normalization Factor) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | TMP_DF10 | 64.00 | 9 | 0 | 485,797 | 29 | 0.28 | 7 | 0.25 |
| 1 | TMP_DF8 | 20.00 | 27 | 0 | 136,350 | 27 | 0.07 | 2 | 0.21 |
| 2 | TMP_DF9 | 28.00 | 62 | 0 | 106,109 | 196 | 0.07 | 9 | 0.35 |
| 3 | TMP_REAN_DF2 | 14.00 | 13 | 0 | 65,715 | 13 | 0.15 | 1 | 0.19 |
| 4 | tmp_benchmark_wide_numeric | 21.00 | 1 | 0 | 5,050 | 0 | nan | nan | nan |
| 5 | tmp_benchmark_wide_text_nulls | 62.00 | 1 | 0 | 5,050 | 19 | nan | nan | nan |
4. Structural Complexity Analysis¶
This section focuses on the metrics that quantify the relational complexity and degree of normalization of the legacy schemas.
display_header("Schema Complexity Metrics (Legacy Databases)")
complexity_metrics = [
'Database', 'Table Count',
'JDI (Join Dependency Index)',
'LIF (Logical Interop. Factor)',
'NF (Normalization Factor)'
]
# Filter for legacy DBs only, as these metrics don't apply to the single-table benchmarks
legacy_df = comparison_df[~comparison_df['Database'].str.contains('benchmark')]
display(legacy_df[complexity_metrics])
# --- Advanced Visualization: Complexity Radar Plot ---
radar_metrics = ['Table Count', 'JDI (Join Dependency Index)', 'NF (Normalization Factor)']
radar_df = legacy_df[['Database'] + radar_metrics].copy()
# Normalize metrics to a 0-1 scale for fair comparison on the radar plot
scaler = MinMaxScaler()
radar_df[radar_metrics] = scaler.fit_transform(radar_df[radar_metrics])
fig = go.Figure()
for index, row in radar_df.iterrows():
fig.add_trace(go.Scatterpolar(
r=row[radar_metrics].values,
theta=radar_metrics,
fill='toself',
name=row['Database']
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, 1]
)),
showlegend=True,
title='Normalized Complexity Profile of Legacy Databases'
)
fig.show()
Schema Complexity Metrics (Legacy Databases)¶
| Database | Table Count | JDI (Join Dependency Index) | LIF (Logical Interop. Factor) | NF (Normalization Factor) | |
|---|---|---|---|---|---|
| 0 | TMP_DF10 | 9.00 | 0.28 | 7.00 | 0.25 |
| 1 | TMP_DF8 | 27.00 | 0.07 | 2.00 | 0.21 |
| 2 | TMP_DF9 | 62.00 | 0.07 | 9.00 | 0.35 |
| 3 | TMP_REAN_DF2 | 13.00 | 0.15 | 1.00 | 0.19 |
5. Query Performance Deep Dive¶
This is the most critical comparison. It directly measures the analytical query performance of the legacy normalized schemas against the denormalized wide-format benchmark schemas using the pre-calculated metrics from the 04_run_comparison.py script.
display_header("Schema Efficiency Factor by Query Category")
if not perf_summary_df.empty:
# First, display the tabular data (NOT logarithmic)
print("📊 Schema Efficiency Factor Data Table (Raw Values):")
print("Note: Values > 1.0 indicate the database is slower than the benchmark baseline")
# Create a pivot table for better readability
if 'schema_efficiency_factor' in perf_summary_df.columns:
efficiency_table = perf_summary_df.pivot_table(
index='database',
columns='category',
values='schema_efficiency_factor',
aggfunc='mean'
).round(2)
# Add a summary column showing average across categories
efficiency_table['Average'] = efficiency_table.mean(axis=1).round(2)
# Style the table to highlight high values
styled_efficiency = efficiency_table.style.background_gradient(cmap='Reds', axis=None)\
.set_caption("Schema Efficiency Factors (Lower is Better - Benchmark = 1.0)")\
.format('{:.2f}')
display(styled_efficiency)
# Also show summary statistics
print(f"\n📈 Summary Statistics:")
print(f"Worst performing database (highest average): {efficiency_table['Average'].idxmax()} ({efficiency_table['Average'].max():.2f}x slower)")
print(f"Best performing database (lowest average): {efficiency_table['Average'].idxmin()} ({efficiency_table['Average'].min():.2f}x)")
else:
print("⚠️ 'schema_efficiency_factor' column not found in performance data")
display(perf_summary_df)
print("\n" + "="*50)
print("📊 Schema Efficiency Factor Chart (Log Scale):")
# Then display the chart with log scale
fig = px.bar(perf_summary_df,
x='database',
y='schema_efficiency_factor',
color='category',
barmode='group',
title='Schema Efficiency Factor (Lower is Better)',
labels={'schema_efficiency_factor': 'Efficiency Factor (Log Scale)', 'database': 'Database'},
category_orders={'category': ['baseline', 'join_performance', 'complex_filtering']})
fig.update_yaxes(type="log") # Use a log scale as differences can be huge
fig.add_hline(y=1.0, line_dash="dot", annotation_text="Benchmark Baseline", annotation_position="bottom right")
fig.update_layout(height=600)
fig.show()
else:
print("No performance data to plot.")
Schema Efficiency Factor by Query Category¶
📊 Schema Efficiency Factor Data Table (Raw Values): Note: Values > 1.0 indicate the database is slower than the benchmark baseline
| category | Baseline Performance | Complex_filtering Performance | Join_performance Performance | Average |
|---|---|---|---|---|
| database | ||||
| TMP_DF10 | 1.04 | 4.58 | 5.46 | 3.69 |
| TMP_DF8 | 0.69 | 4.38 | 0.58 | 1.88 |
| TMP_DF9 | 1.05 | 1.34 | 1.56 | 1.32 |
| TMP_REAN_DF2 | 0.62 | 1.11 | 0.86 | 0.86 |
| tmp_benchmark_wide_numeric | 1.00 | 1.29 | 1.00 | 1.10 |
| tmp_benchmark_wide_text_nulls | 1.37 | 1.00 | 1.34 | 1.24 |
📈 Summary Statistics: Worst performing database (highest average): TMP_DF10 (3.69x slower) Best performing database (lowest average): TMP_REAN_DF2 (0.86x) ================================================== 📊 Schema Efficiency Factor Chart (Log Scale):
display_header("Performance Improvement vs. Best Benchmark")
if not perf_summary_df.empty:
# First, display the tabular data
print("📊 Performance Improvement Data Table:")
print("Note: Positive values indicate how much faster the benchmark is compared to legacy databases")
# Filter out the baseline databases themselves for cleaner analysis
improvement_df = perf_summary_df[~perf_summary_df['database'].str.contains('benchmark')].copy()
if not improvement_df.empty and 'performance_improvement_factor' in improvement_df.columns:
# Create a summary table showing improvement factors
improvement_summary = improvement_df.groupby(['database', 'category'])['performance_improvement_factor'].agg(['mean', 'min', 'max']).round(1)
improvement_summary.columns = ['Avg_Improvement_%', 'Min_Improvement_%', 'Max_Improvement_%']
# Reset index to make it more readable
improvement_summary = improvement_summary.reset_index()
improvement_pivot = improvement_summary.pivot(index='database', columns='category', values='Avg_Improvement_%').round(1)
# Style the table
styled_improvement = improvement_pivot.style.background_gradient(cmap='Greens', axis=None)\
.set_caption("Average Performance Improvement (% faster than benchmark)")\
.format('{:.1f}%')
display(styled_improvement)
# Show detailed breakdown
print(f"\n📋 Detailed Performance Improvement Breakdown:")
detailed_table = improvement_df[['database', 'category', 'query_id', 'performance_improvement_factor']].copy()
detailed_table['performance_improvement_factor'] = detailed_table['performance_improvement_factor'].round(1)
detailed_table = detailed_table.rename(columns={
'performance_improvement_factor': 'Improvement_%'
})
# Sort by improvement factor for better readability
detailed_table = detailed_table.sort_values(['database', 'category', 'Improvement_%'], ascending=[True, True, False])
display(detailed_table)
# Summary statistics
print(f"\n📈 Key Insights:")
best_db = improvement_pivot.mean(axis=1).idxmin()
worst_db = improvement_pivot.mean(axis=1).idxmax()
print(f"Most consistent performer: {best_db} (avg {improvement_pivot.mean(axis=1)[best_db]:.1f}% improvement over benchmark)")
print(f"Least consistent performer: {worst_db} (avg {improvement_pivot.mean(axis=1)[worst_db]:.1f}% improvement over benchmark)")
else:
print("⚠️ 'performance_improvement_factor' column not found or no legacy database data available")
if not improvement_df.empty:
display(improvement_df)
print("\n" + "="*50)
print("📊 Performance Improvement Chart:")
# Then display the chart
fig = px.bar(improvement_df.sort_values('performance_improvement_factor') if not improvement_df.empty else improvement_df,
x='query_id',
y='performance_improvement_factor',
color='database',
facet_row='category',
barmode='group',
title='Performance Improvement of Benchmark Schemas vs. Legacy Schemas',
labels={'performance_improvement_factor': '% Improvement vs. Benchmark', 'query_id': 'Query ID'})
fig.update_layout(height=800)
fig.show()
else:
print("Could not generate performance improvement plot.")
Performance Improvement vs. Best Benchmark¶
📊 Performance Improvement Data Table: Note: Positive values indicate how much faster the benchmark is compared to legacy databases
| category | Baseline Performance | Complex_filtering Performance | Join_performance Performance |
|---|---|---|---|
| database | |||
| TMP_DF10 | 4.2% | 78.2% | 81.7% |
| TMP_DF8 | -44.0% | 77.2% | -72.4% |
| TMP_DF9 | 4.9% | 25.5% | 35.9% |
| TMP_REAN_DF2 | -61.7% | 9.7% | -16.9% |
📋 Detailed Performance Improvement Breakdown:
| database | category | query_id | Improvement_% | |
|---|---|---|---|---|
| 6 | TMP_DF10 | Baseline Performance | Query 1.1 | 4.20 |
| 8 | TMP_DF10 | Complex_filtering Performance | Query 3.1 | 78.20 |
| 7 | TMP_DF10 | Join_performance Performance | Query 2.1 | 81.70 |
| 9 | TMP_DF8 | Baseline Performance | Query 1.1 | -44.00 |
| 11 | TMP_DF8 | Complex_filtering Performance | Query 3.1 | 77.20 |
| 10 | TMP_DF8 | Join_performance Performance | Query 2.1 | -72.40 |
| 12 | TMP_DF9 | Baseline Performance | Query 1.1 | 4.90 |
| 14 | TMP_DF9 | Complex_filtering Performance | Query 3.1 | 25.50 |
| 13 | TMP_DF9 | Join_performance Performance | Query 2.1 | 35.90 |
| 15 | TMP_REAN_DF2 | Baseline Performance | Query 1.1 | -61.70 |
| 17 | TMP_REAN_DF2 | Complex_filtering Performance | Query 3.1 | 9.70 |
| 16 | TMP_REAN_DF2 | Join_performance Performance | Query 2.1 | -16.90 |
📈 Key Insights: Most consistent performer: TMP_REAN_DF2 (avg -23.0% improvement over benchmark) Least consistent performer: TMP_DF10 (avg 54.7% improvement over benchmark) ================================================== 📊 Performance Improvement Chart:
6. Qualitative Architectural Trade-offs¶
The quantitative data above supports a qualitative assessment of the architectural trade-offs between the legacy design and the proposed wide-format design.
| Feature | Legacy Normalized (e.g., DF9) | Proposed Wide-Format (Benchmark) | Justification Based on Data |
|---|---|---|---|
| Query Performance | Low |
High |
The 'Schema Efficiency Factor' chart shows legacy databases are multiple times slower. |
| Storage Cost | Low |
High |
comparison_matrix.csv shows benchmark DBs are larger due to data duplication. |
| Schema Complexity | High (High JDI/NF, Many Tables) |
Very Low (1 Table) |
The complexity radar plot visually confirms the high complexity scores of the legacy schemas. |
| Data Redundancy | Low (Normalized) |
High (Denormalized) |
This is the inherent trade-off of the wide-format design; we trade storage for speed. |
| Ease of Use for BI/GIS | Low (Requires complex joins) |
High (Single table source) |
A single flat table is trivial to connect to tools like QGIS, Tableau, or Power BI. |
7. Final Analyst Summary & Recommendation¶
Instructions: Based on the comparative analysis, synthesize the findings and provide a formal recommendation for the Phase 2 unified database architecture. This summary will be a primary input for the final white paper.
Overarching Conclusion:¶
- Start with a concise, definitive statement. Example: "The comparative analysis demonstrates conclusively that the highly normalized structure of the legacy databases, particularly
tmp_df9, is quantitatively inferior for the project's analytical objectives compared to a denormalized, wide-format architecture."
Justification from Evidence:¶
- On Performance:
- Quantify the performance difference. Reference the 'Schema Efficiency Factor' chart directly. Example: "As shown in the efficiency factor plot, the legacy schemas are between 5x and 50x slower for join-heavy queries than the wide-format benchmarks. This performance gap makes interactive analysis on the normalized schemas untenable."
- On Complexity:
- Reference the complexity metrics and the radar plot. Example: "The legacy schemas exhibit high JDI and NF scores, indicative of significant relational complexity that increases the cognitive load for analysts and the technical barrier for connecting to BI and GIS tools. The radar plot clearly visualizes
tmp_df9as the most complex outlier."
- Reference the complexity metrics and the radar plot. Example: "The legacy schemas exhibit high JDI and NF scores, indicative of significant relational complexity that increases the cognitive load for analysts and the technical barrier for connecting to BI and GIS tools. The radar plot clearly visualizes
- On The Cost/Benefit Trade-off:
- Acknowledge the trade-offs identified in the qualitative table. Example: "While the wide-format approach increases storage costs due to data redundancy, this trade-off is strategically acceptable. The cost of storage is minimal compared to the significant gains in query performance and the drastic reduction in development time and analytical friction for end-users."
Formal Recommendation:¶
- State the final recommendation clearly and unambiguously.
- Recommended Architecture: "It is the formal recommendation of this analysis that Phase 2 of the Digital TMP project proceeds with the development of a single, denormalized, wide-format primary analytical table. This table should be based on the schema of the
tmp_benchmark_wide_text_nullsdatabase, as it provides the best balance of performance and human-readability."* - Next Steps: "The next step should be to finalize the schema of this wide-format table, including data type assignments and column naming conventions, and to proceed with the development of the full ETL pipeline in Phase 2 to migrate all legacy data into this new structure."*